Day 2 Session 1:
Dynamic spectral analysis
using Functional Principal Component Analysis (FPCA)

Introduction

Yesterday, we looked at two statistical analysis methods: linear mixed-effect models (LMEs) and generalised additive mixed-effect models (GAMMs). These techniques can, in a way, be considered as top-down approaches to both static (LMEs) and dynamic analyses (GAMMs) given that predictor variables need to be specified beforehand. These models are useful when you have some knowledge about specific variables of interest.

In contrast, our studies can often be exploratory, in which we do not have much prior knowledge regarding e.g., what predictor variables influence the outcome variables, how predictor variables interact with each other, etc. In other words, we would like to see what structures emerge from the data.

Today, we will take a brief look at two bottom-up, exploratory approaches to data analysis: principal component analysis (PCA) and functional principal component analysis (FPCA). These tools can be useful when (1) reducing data dimensions into a manageable number of variables and (2) identifying the degrees to which each parameter contributes to the data structure.

To contexualise today’s workshop, we actually know already that the lower three formants (F1, F2 and F3) are important dimensions to understand L1 Japanese speakers’ production of L2 English /l/ and /ɹ/, so it might not be that interesting if we stick to these. For this reason, let’s extend our scope of data analysis and take duration into account.

Below, similarly to yesterday, we’ll start with the static data analysis using PCA. FPCA is an extension of PCA, so you’d find it easier to understand FPCA once you’ve seen PCA first.

Principal Component Analysis (PCA)

Let’s apply Principal Component Analysis (PCA) onto the static data.

Preliminaries

Importing data set

Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.

# import the csv file "initial.liquid.static.csv" from the "data" directory and save it as "df_mid"
df_mid <- readr::read_csv("data/initial.liquid.static.csv")

Checking data and data wrangling

As usual, let’s check what variables are included in the data set using colnames().

# Let's check what columns the data frame contains
colnames(df_mid)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "duration"       "segment"        "word"           "f1"            
##  [9] "f2"             "f3"             "previous_sound" "next_sound"    
## [13] "percent"        "IsApproximant"  "IsAcoustic"     "gender"        
## [17] "omit"           "position"

Also remove irrelevant columns and rename the variables in the vowel column as has been done previously.

# Remove columns 
df_mid <- df_mid |> 
  dplyr::select(-c(IsApproximant, IsAcoustic, omit))

# Rename vowel variables
df_mid <- df_mid |> 
  dplyr::mutate(
    vowel = case_when(
      next_sound == "AE1" ~ "/æ/",
      next_sound == "IY1" ~ "/i/",
      next_sound == "UW1" ~ "/u/",
    )
  )

# z-normalise formant values for cross-speaker comparison
df_mid <- df_mid |> 
  dplyr::group_by(speaker) |> 
  dplyr::mutate(
    f1z = scale(f1),
    f2z = scale(f2),
    f3z = scale(f3)
  ) |> 
  dplyr::ungroup()

Your turn

Please write a code below to inspect the data. Try obtaining descriptive statistics (e.g., the number of tokens, participants etc)

# Check column names again
# ...

Data visualisation

Let’s try data visualisation. First, we’ll compare between-group difference using a combination of scatter, box and violin plots. In order to plot everything altogether, we’ll convert the data set into long data using the tidyr::pivot_longer() function.

Spectral characteristics

Violin plot

# convert the data set into a long data
df_mid_long <- df_mid |> 
  dplyr::select(-c(f1, f2, f3)) |> 
  tidyr::pivot_longer(14:16, names_to = "formant", values_to = "values")

# plot
df_mid_long |> 
  ggplot(aes(x = language, y = values)) +
  geom_jitter(aes(colour = language), alpha = 0.5) +
  geom_violin(alpha = 0.4) +
  geom_boxplot(width = 0.3, alpha = 0.4) +
  facet_grid(vowel ~ formant) +
  scale_colour_manual(values = cbPalette) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

Correlation

Let’s also take a slightly different approach. The objective of PCA is to draw straight lines along the dimension that shows the greatest variance. To inspect the variance in the data, we’ll plot the relationships between the three spectral dimensions.

# F1 and F2
df_mid |> 
  ggplot(aes(x = f1z, y = f2z)) +
  geom_point(aes(colour = segment), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ language)

# F2 and F3
df_mid |> 
  ggplot(aes(x = f2z, y = f3z)) +
  geom_point(aes(colour = segment), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ language)

# F1 and F3
df_mid |> 
  ggplot(aes(x = f1z, y = f3z)) +
  geom_point(aes(colour = segment), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ language)

As expected, English /l/ and /ɹ/ can be reliably distinguished along the F3 dimension, such that English /ɹ/ shows lower F3 than English /l/.

Temporal characteristics

OK, then how about duration? Let’s visualise them first, and then check how it correlates with each of the spectral parameters. We’ll use the original df_mid data frame again but convert the duration into millisecond.

Violin plot

# convert duration (s) into (ms)
df_mid <- df_mid |> 
  dplyr::mutate(
    duration_ms = duration * 1000,
    duration_ms_z = scale(duration_ms)
  )

# plot
df_mid |> 
  ggplot(aes(x = language, y = duration_ms_z)) +
  geom_jitter(aes(colour = language), alpha = 0.5) +
  geom_violin(alpha = 0.4) +
  geom_boxplot(width = 0.3, alpha = 0.4) +
  facet_grid(~ segment) +
  scale_colour_manual(values = cbPalette) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

Correlation

Let’s also check whether duration correlates with any of the spectral measures.

# with F1 
df_mid |> 
  ggplot(aes(x = f1z, y = duration_ms_z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(segment ~ language)

# with F2
df_mid |> 
  ggplot(aes(x = f2z, y = duration_ms_z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(segment ~ language)

# with F3
df_mid |> 
  ggplot(aes(x = f3z, y = duration_ms_z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(segment ~ language)

Mmm, we do not seem to find particularly interesting correlations here. But this is fine for now, as we’re just exploring this. The key takeaway here is that we always need multiple plots in order to obtain a holistic understanding of the data structure, which is somewhat cumbersome. Here, PCA comes in handy, as it is a dimension reduction technique and boils down the variance in the data into a handful of principal components.

Applying PCA

Let’s apply PCA on the formant data. We’ll classify the tokens based on the z-scored F1, F2 and F3. We use the prcomp() function to perform PCA.

# rearrange column order for easier data subsetting
df_mid <- df_mid |> 
  dplyr::relocate(f1z, f2z, f3z, duration_ms_z)

# check colnames
colnames(df_mid)
##  [1] "f1z"            "f2z"            "f3z"            "duration_ms_z" 
##  [5] "...1"           "file"           "speaker"        "language"      
##  [9] "duration"       "segment"        "word"           "f1"            
## [13] "f2"             "f3"             "previous_sound" "next_sound"    
## [17] "percent"        "gender"         "position"       "vowel"         
## [21] "duration_ms"
# PCA on F1z, F2z, F3z and duration
pca_mid <- prcomp(select(df_mid, 1:4))

# check summary
summary(pca_mid)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.1365 1.0302 0.9856 0.7863
## Proportion of Variance 0.3276 0.2692 0.2464 0.1568
## Cumulative Proportion  0.3276 0.5968 0.8432 1.0000

The summary shows three rows:

  • Standard deviation: This fundamentally shows the variance shown in the data. The greater the value, the more variance is captured by each PC.

  • Proportion of Variance: This expresses the amount of variance explained by each PC in a proportional manner. This can be calculated with the following formulae:

\[ pca\_var\_exp = \frac{(\text{sdev})^2}{\sum pca\_var} \]

  • Cumulative Proportion: This simply shows the sum of the proportion of variance explained by each PC. For instance, PC1 explains 43.37% of variance whereas PC2 33.68%. The cumulative proportion of PC1 and PC2 is thus 77.05% (i.e., 43.37% + 33.68%).

Another important index in PCA is eigenvalues. They represent how much variance is explained by each PC. This sounds similar to some of the values we already saw: standard deviation and (cumulative) proportion. Eigenvalues are computed based on starndard deviation, and then prooprtion of variance is calculated based on eigenvalues.

# calculate eigenvalues
eigenvalues <- pca_mid$sdev^2

# show eigenvalues
eigenvalues
## [1] 1.2916551 1.0612984 0.9714709 0.6183089
# calculating proportion of variance 
pca_var_exp <- eigenvalues / sum(eigenvalues)  # Proportion of variance

pca_var_exp
## [1] 0.3276040 0.2691783 0.2463953 0.1568224

PCA summary

Let’s also look into the inside of pca_mid.

# show the PCA results
pca_mid
## Standard deviations (1, .., p=4):
## [1] 1.1365100 1.0301934 0.9856322 0.7863262
## 
## Rotation (n x k) = (4 x 4):
##                      PC1         PC2        PC3        PC4
## f1z           0.06228593  0.54478246  0.8043986  0.2286382
## f2z           0.64132789  0.26005576 -0.3971193  0.6027983
## f3z           0.72630007 -0.01361938  0.1439881 -0.6719897
## duration_ms_z 0.23938924 -0.79711830  0.4177398  0.3644018

The top line shows standard deviation, which we just saw earlier, so let’s skip this for now.

At the bottom, we have loadings matrix. This shows eigenvectors: i.e., the amount and (2) the direction of the contribution that the original data dimensions make to each PC. Let’s disentangle this one by one:

  • PC1: Contributions of f1z, f2z, f3z and duration are all in the same direction (i.e., positive). But the amount of contribution is fairly small for f1z and duration compared to f2z and f3z.

  • PC2: In contrast to PC1, f1z contributes to PC2 to a greater extent than f2z and f3z. duration also seems to make a big contribution here but in the opposite direction (i.e., negative).

  • PC3: The contribution of duration seems quite big here, too.

  • PC4: f2z and f3z both seem to contribute to this dimension, although they are in antagonistic direction.

So, all in all, PCA might suggest:

  • PC1 captures a covariation between f2z and f3z, which may correspond to the spectral difference between /l/ and /ɹ/.

  • PC2 mainly captures a covariation of duration and f1z but in opposite directions.

  • PC3 shows variation in f1z.

  • PC4 again seems to capture f2z and f3z but in opposite directions.

The interpretation may be better facilitated by data visualisation. Let’s try two main approaches to data visualiastion of PCA.

Data visualisation 1: Scree plot

The amount of variance is useful when determining which PC to retain for analysis. As a rule of thumb, Bayeen (2008) recommends that we retain PCs that explains greater than 5% of variance in the data.

A useful visualisation of the proportion of variance is scree plot. There is a default function screeplot() that lets you create a scree plot quite easily:

# scree plot: bar chart
screeplot(pca_mid)

# scree plot: line plot
screeplot(pca_mid, type = "line")

You can also create a scree plot using ggplot() by manually calculating the proportion of variance based on the standard deviation (see above for the formula). This is useful when you need more flexibility – e.g., to show the 5% thereshold suggested by Bayeen (2008).

## obtaining proportion of variance 
pca_var_exp <- pca_mid$sdev^2 / sum(eigenvalues)

# making var_explained as a tibble 
pca_var_exp <- tibble::as_tibble(pca_var_exp)

# adding column name
pca_var_exp <- pca_var_exp |>  
  tibble::as_tibble() |>  
  dplyr::mutate(
    PC = row_number()
  )

# create a plot
scree_plot <- pca_var_exp |> 
  ggplot(aes(x = PC, y = value)) +
  geom_line() +
  geom_text(aes(label = round(value, digits = 3)*100), nudge_x = 0.2) +
  geom_label(aes(label = PC), label.padding = unit(0.40, "lines")) +
  geom_hline(yintercept = 0.05, linetype = 'dotted') +
  scale_x_continuous(breaks = c(1, 2, 3)) +
  labs(x = "Principal Component", y = "Variance Explained", title = "Proportion of Variance explained by each PC")

# showing plot
scree_plot

Since there are only three PCs identified from the data, it doesn’t make much sense to visualise them. But this will be useful when prcomp() identifies more than a few PCs.

creating biplot

The interpretation of loading matrix was more or less straightforward in this case given that only three parameters were involved. It is easy to see, however, that this can get quite complicated when the original data have a greater number of dimensions.

Biplots help us better understand the loadings and PC scores for each observation. For example, the default function biplot() plots all observations along the PC1 and PC2 dimensions with eigenvectors shown in red arrows (i.e., the direction and the amount of weighting: also called loadings).

# default function for biplot
biplot(pca_mid)

Detailed biplots

Similarly to the scree plot, we could also create biplots manually using ggplot(). This will help us better understand what each PC could stand for in the current data set.

# Extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)

# Extract PCA loadings (the contribution of each original variable to the PCs)
pca_loadings <- as.data.frame(pca_mid$rotation)

# Combine scores into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel")])

# Reshape the loadings for visualization
loadings_df <- pca_loadings |> 
  mutate(variable = rownames(pca_loadings))

# Plotting the PCA biplot with facets by language
ggplot() +
  geom_point(data = scores_df, aes(x = PC1, y = PC2, color = vowel, shape = vowel), size = 2, alpha = 0.5) + # Plot the PCA scores (observations)
  geom_segment(data = loadings_df, aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), arrow = arrow(type = "closed", length = unit(0.1, "inches")), color = "red") + # Plot the loadings (variables)
  geom_text(data = loadings_df, aes(x = PC1 * 5, y = PC2 * 5, label = variable), size = 4, color = "red") + # Add labels for the loadings
  stat_ellipse(data = scores_df, aes(x = PC1, y = PC2, color = vowel, fill = vowel), geom = "polygon", alpha = 0.2, level = 0.95) + 
  facet_grid(segment ~ language) +
  labs(title = "PCA biplot by language") +
  scale_color_manual(values = cbPalette) + 
  scale_fill_manual(values = cbPalette) + 
  theme(legend.position = "bottom",
        strip.text.y = element_text(angle = 360))

Using PC scores as data

So far, we have identified four PCs, each of which explains more than 5% of variance in the data. At first glance, this doesn’t seem to be a good way of data reduction, as we got four PC dimensions out of four variables (f1z, f2z, f3z, and duration).

However, we could also argue that data reduction has been indeed achieved in the sense that some PCs capture joint contributions of multiple parameters – e.g., PC1 showing covariation of f2z and f3z. In other words, we have identified in a data-driven manner that the two spectral parameters act together, and we have compressed the two parameters into one PC dimension (i.e., PC1).

One of the advantages of using PCA is that it converts the data onto new scales in a data-driven manner. More specifically, PCA associates each data point with new numeric values called PC scores, showing variation of each data point along each PC dimensions. As a common practice, we can use PC scores as input to further data visualisation and statistical analysis.

Extracting PC scores

PC scores are stored as x in the PCA output. Let’s use head() function to only display the first six rows as otherwise the list would be too big.

# displaying PC scores
head(pca_mid$x)
##               PC1        PC2        PC3        PC4
## [1,] -0.008988979 -0.3169801 1.77770054 -0.5644188
## [2,]  0.306799132 -0.2514384 1.55617832 -0.5399825
## [3,] -0.102459933  0.7019741 1.24214125 -0.9686917
## [4,]  0.210710158 -0.1331311 0.02665082 -1.1115056
## [5,]  0.125017636 -0.5062928 1.38864989 -1.1168562
## [6,]  0.222555474 -0.5372844 1.96536854 -0.4210074

Assuming that we haven’t made any changes to the order of rows, we can combine the PC scores with the existing data set. We have actually already done this when visualising the data, so I simply copy and paste the codes here:

# extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)

# check the number of rows
## pc score
nrow(pca_scores)
## [1] 2306
## main data set
nrow(df_mid)
## [1] 2306

Both pc_score and df_mid contain the same number of rows, so we can merge them into one data frame.

# combine scores into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel", "speaker", "gender", "word")])

Check PC scores

Before visualisation, let’s take a look at descriptive statistics of PC scores.

scores_df |> 
  dplyr::group_by(language, segment, vowel) |> 
  dplyr::summarise(
    PC1_mean = mean(PC1),
    PC1_sd = sd(PC1),
    PC2_mean = mean(PC2),
    PC2_sd = sd(PC2),
    PC3_mean = mean(PC3),
    PC3_sd = sd(PC3),
    PC4_mean = mean(PC4),
    PC4_sd = sd(PC4)
  ) |> 
  dplyr::ungroup()
## `summarise()` has grouped output by 'language', 'segment'. You can override
## using the `.groups` argument.
## # A tibble: 12 × 11
##    language segment vowel PC1_mean PC1_sd PC2_mean PC2_sd PC3_mean PC3_sd
##    <chr>    <chr>   <chr>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
##  1 English  L       /i/      0.984  0.872 -0.360    0.984   -0.186  1.01 
##  2 English  L       /u/      0.929  0.840 -0.782    1.08    -0.297  0.785
##  3 English  L       /æ/      0.806  0.632  0.00223  1.06     0.980  0.859
##  4 English  R       /i/     -0.539  0.770 -0.167    1.11    -0.380  0.748
##  5 English  R       /u/     -0.765  0.679 -0.285    1.07    -0.284  0.612
##  6 English  R       /æ/     -0.948  0.836  0.285    1.24     0.439  0.782
##  7 Japanese L       /i/      0.870  0.765  0.00313  0.715   -0.561  0.950
##  8 Japanese L       /u/      0.179  0.898 -0.464    0.896   -0.632  0.848
##  9 Japanese L       /æ/      0.525  0.856  0.311    0.936    0.591  0.984
## 10 Japanese R       /i/     -0.141  1.03   0.242    0.823   -0.405  0.783
## 11 Japanese R       /u/     -0.969  0.910 -0.178    0.800   -0.453  0.702
## 12 Japanese R       /æ/     -0.742  1.08   0.399    1.02     0.426  0.792
## # ℹ 2 more variables: PC4_mean <dbl>, PC4_sd <dbl>

We get lots of numbers here – please do feel free to take a moment to digest these. But we could also visualise them to better facilitate the interpretation.

Visualising PCs

PC1

Let’s visualise PC scores to see if our assumptions about each PC is correct. We think that PC1 captures covariation of f2z and f3z, and this should correspond to the contrast between English /l/ (lower F2, higher F3) and /ɹ/ (higher F2, lower F3).

## PC1 - by liquid consonant
scores_df |> 
  ggplot(aes(x = segment, y = PC1, colour = segment)) +
  geom_jitter(width = 0.3) +
  geom_violin(alpha = 0.3) +
  geom_boxplot(width = 0.3, colour = "black") +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(language ~ vowel) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

This is more or less true, good! We could also make a between-group comparison.

## PC1 - by group
scores_df |> 
  ggplot(aes(x = language, y = PC1, colour = language)) +
  geom_jitter(width = 0.3) +
  geom_violin(alpha = 0.3) +
  geom_boxplot(width = 0.3, colour = "black") +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(segment ~ vowel) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

We can see some patterns here: e.g., both L1 English and L1 Japanese speakers are similar for /l/, L1 Japanese speakers produce smaller PC1 scores in the /u/ context than L1 English speakers. Some between-group differences can be found for English /ɹ/ as well.

Your turn

In a similar manner to PC1, please visualise PC2, PC3 and PC4. Please try different visualisation methods to explore what each PC dimension highlights.

PC2

PC2 mainly captured covariation between duration and f1z. This is a little less intuitive, but previous research has indeed shown that L1 English and L1 Japanese speakers may exhibit different F1 transition duration profiles. So, it would be interesting to visualise PC2 to show between-group differences:

## PC2
### scores_df |> ...

PC3

PC3 captures variation of f1z predominantly. So I would assume that this captures variation of the vocalic context – e.g., higher F1 for low vowels.

## PC3
### scores_df |> ...

PC4

PC4 is similar to PC1 in that it captures variation of f2z and f3z, but here the direction is opposite. I would still assume that this may result from differences in vocalic context:

## PC4
### scores_df |> ...

Statistical analysis

As I said earlier, as a somewhat common practice, PC scores can serve as input to further statistical analysis. So far, we have found that PC1 captures the contrast between English /l/ and /ɹ/; although this is sort of obvious, the most important message here is that the data tells us about it without any apriori knowledge.

In the data visualisation above, it was somewhat difficult to identify which of these variables: language, vowel and segment, would have a significant effect. Let’s construct a linear mixed effect model to formally test this.

Your turn

Using knowledge from yesterday, please construct a linear mixed-effect model to predict z-noramlised PC1 values by:

  • language

  • segment

  • an interaction between language and segment

Hint: My hunch here is that by-speaker random effect would lead to a singular fit warning, meaning that it causes issues to the model estimate. One possible account might be that PC1z may not be showing much speaker-specific variation because we have normalised PC values that were based on spectral and duration values that had already been within-speaker normalised.

# converting variables into factor
scores_df <- scores_df |> 
  dplyr::mutate(
    language = as.factor(language),
    vowel = as.factor(vowel),
    segment = as.factor(segment),
    speaker = as.factor(speaker),
    word = as.factor(word)
  )

# run a full model
m1 <- lme4::lmer(PC1 ~ language + segment + vowel + language:segment + language:vowel + (1|speaker) + (1|word), data = scores_df, REML = FALSE)

# model summary
summary(m1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: 
## PC1 ~ language + segment + vowel + language:segment + language:vowel +  
##     (1 | speaker) + (1 | word)
##    Data: scores_df
## 
##      AIC      BIC   logLik deviance df.resid 
##   5929.4   5992.5  -2953.7   5907.4     2295 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1466 -0.6199 -0.0194  0.5930  4.9442 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  speaker  (Intercept) 0.009092 0.09535 
##  word     (Intercept) 0.005558 0.07455 
##  Residual             0.748043 0.86490 
## Number of obs: 2306, groups:  speaker, 45; word, 16
## 
## Fixed effects:
##                           Estimate Std. Error t value
## (Intercept)                0.74967    0.06889  10.882
## languageJapanese          -0.28313    0.07487  -3.782
## segmentR                  -1.65190    0.06750 -24.471
## vowel/i/                   0.29449    0.07842   3.756
## vowel/u/                   0.15406    0.08607   1.790
## languageJapanese:segmentR  0.49938    0.07353   6.792
## languageJapanese:vowel/i/  0.18394    0.08524   2.158
## languageJapanese:vowel/u/ -0.43883    0.09347  -4.695
## 
## Correlation of Fixed Effects:
##                           (Intr) lnggJp sgmntR vowel/i/ vowel/u/ lngJ:R
## languagJpns               -0.673                                       
## segmentR                  -0.476  0.300                                
## vowel/i/                  -0.537  0.336 -0.042                         
## vowel/u/                  -0.496  0.312 -0.024  0.447                  
## lnggJpns:sR                0.299 -0.475 -0.638  0.039    0.022         
## languageJapanese:vowel/i/  0.336 -0.499  0.039 -0.642   -0.285   -0.057
## languageJapanese:vowel/u/  0.313 -0.463  0.022 -0.285   -0.633   -0.035
##                           languageJapanese:vowel/i/
## languagJpns                                        
## segmentR                                           
## vowel/i/                                           
## vowel/u/                                           
## lnggJpns:sR                                        
## languageJapanese:vowel/i/                          
## languageJapanese:vowel/u/  0.425
# nested model 1 -- testing the language-segment interaction
m2 <- lme4::lmer(PC1 ~ language + segment + vowel + language:vowel + (1|speaker) + (1|word), data = scores_df, REML = FALSE)

# model comparison
anova(m1, m2, test = "Chisq")
## Data: scores_df
## Models:
## m2: PC1 ~ language + segment + vowel + language:vowel + (1 | speaker) + (1 | word)
## m1: PC1 ~ language + segment + vowel + language:segment + language:vowel + (1 | speaker) + (1 | word)
##    npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## m2   10 5973.0 6030.4 -2976.5   5953.0                         
## m1   11 5929.4 5992.5 -2953.7   5907.4 45.612  1  1.441e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# nested model 2 -- testing the language-vowel interaction
m3 <- lme4::lmer(PC1 ~ language + segment + vowel + language:segment + (1|speaker) + (1|word), data = scores_df, REML = FALSE)

# model comparison
anova(m1, m3, test = "Chisq")
## Data: scores_df
## Models:
## m3: PC1 ~ language + segment + vowel + language:segment + (1 | speaker) + (1 | word)
## m1: PC1 ~ language + segment + vowel + language:segment + language:vowel + (1 | speaker) + (1 | word)
##    npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## m3    9 5968.0 6019.7 -2975.0   5950.0                         
## m1   11 5929.4 5992.5 -2953.7   5907.4 42.679  2  5.399e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# post-hoc analysis
## between-group effects of each language
emmeans::emmeans(m1, pairwise ~ language | segment:vowel, adjust = "tukey")
## $emmeans
## segment = L, vowel = /æ/:
##  language emmean     SE   df lower.CL upper.CL
##  English   0.750 0.0746 63.5   0.6007   0.8986
##  Japanese  0.467 0.0647 37.0   0.3354   0.5977
## 
## segment = R, vowel = /æ/:
##  language emmean     SE   df lower.CL upper.CL
##  English  -0.902 0.0754 66.0  -1.0528  -0.7516
##  Japanese -0.686 0.0651 37.4  -0.8178  -0.5542
## 
## segment = L, vowel = /i/:
##  language emmean     SE   df lower.CL upper.CL
##  English   1.044 0.0769 68.1   0.8907   1.1976
##  Japanese  0.945 0.0689 45.6   0.8062   1.0838
## 
## segment = R, vowel = /i/:
##  language emmean     SE   df lower.CL upper.CL
##  English  -0.608 0.0748 62.5  -0.7572  -0.4583
##  Japanese -0.208 0.0672 42.5  -0.3432  -0.0719
## 
## segment = L, vowel = /u/:
##  language emmean     SE   df lower.CL upper.CL
##  English   0.904 0.0858 61.9   0.7323   1.0752
##  Japanese  0.182 0.0782 42.6   0.0239   0.3396
## 
## segment = R, vowel = /u/:
##  language emmean     SE   df lower.CL upper.CL
##  English  -0.748 0.0849 60.1  -0.9180  -0.5783
##  Japanese -0.971 0.0772 40.9  -1.1267  -0.8148
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## 
## $contrasts
## segment = L, vowel = /æ/:
##  contrast           estimate     SE  df t.ratio p.value
##  English - Japanese   0.2831 0.0756 216   3.745  0.0002
## 
## segment = R, vowel = /æ/:
##  contrast           estimate     SE  df t.ratio p.value
##  English - Japanese  -0.2162 0.0768 230  -2.816  0.0053
## 
## segment = L, vowel = /i/:
##  contrast           estimate     SE  df t.ratio p.value
##  English - Japanese   0.0992 0.0814 268   1.218  0.2242
## 
## segment = R, vowel = /i/:
##  contrast           estimate     SE  df t.ratio p.value
##  English - Japanese  -0.4002 0.0780 237  -5.129  <.0001
## 
## segment = L, vowel = /u/:
##  contrast           estimate     SE  df t.ratio p.value
##  English - Japanese   0.7220 0.0893 392   8.083  <.0001
## 
## segment = R, vowel = /u/:
##  contrast           estimate     SE  df t.ratio p.value
##  English - Japanese   0.2226 0.0876 371   2.541  0.0115
## 
## Degrees-of-freedom method: kenward-roger

Summary

I hope I have shown that PCA is a strong tool to identify salient dimensions in the data in a bottom-up/data-driven manner. It projects the data onto a new dimension based on the identified principal components. Furthermore, we can obtain PC scores for each observation along the PC dimensions, and we can use them as input to further statistical analysis. The flexibility in the choice of statistical modelling is an advantage of PCA.

Summarising what we have found so far:

  • PCA here identifies four principal components (PCs) that explains 32.76% (PC1), 26.92% (PC2), 24.64% (PC3) and 15.68% (PC4) of the variance in the data.

  • The largest proportion of variance is explained by PC1 that captures a covariation of F2 and F3 as suggested by the biplot. The data visualisation clearly shows that English /l/ and /ɹ/ can be distinguished along PC1 (English /l/: higher PC1 values, English /ɹ/: lower PC1 values)

  • We can also use PC scores as data for further statistical analysis. Analysis with linear mixed-effect models shows that:

    • the interactions language:vowel and language:segment are both statistically significant

    • L1 Japanese and L1 English speakers differ in PC1 values at statistically significant level across all conditions but for English /l/ in the /i/ context.

Functional Principal Component Analysis (FPCA)

In the previous section, we performed principal component analysis (PCA) on the static data – i.e., the midpoint measurement of formant frequencies in the production of English /l/ and /ɹ/ by L1 Japanese and L1 English speakers. We have found that F2 and F3 seem to be the key in understanding their production judging from their contributions to PC1.

Whereas we have identified what parameters would characterise the between-group difference, it does not tell us much about how the two speaker groups differ. More specifically, we now know that F2 and F3 are important to understand English /l/ and /ɹ/ productions better, our ultimate aim is to know how L1 Japanese speakers differ from L1 English speakers in producing English /l/ and /ɹ/. Let’s turn to dynamic analysis and explore how the two groups of speakers differ in the realisations of F2 and F3 over time.

We will use Functional Principal Component Analysis (FPCA) to explore salient dynamic properties in the data. The basic architecture is quite similar to the ordinary PCA that we have just seen with a couple of key differences as it deals with functional data.

Preliminaries

Installing/loading packages

For the dynamic analysis using FPCA, we are using fdapace() package, so let’s install it here.

# installing packages
# install.packages("tidyverse")
# install.packages("fdapace")

# importing packages
library(tidyverse)
library(fdapace)

Importing data set

Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.

# import the csv file "initial.liquid.dynamic.csv" from the "data" directory and save it as "df_dyn"
df_dyn <- readr::read_csv("data/initial.liquid.dynamic.csv")

Data wrangling

Check data

We always start with inspecting the data set using colnames().

# Let's check what columns the data frame contains
colnames(df_dyn)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "f0"             "duration"       "segment"        "previous_sound"
##  [9] "next_sound"     "word"           "f1"             "f2"            
## [13] "f3"             "time"           "IsApproximant"  "IsAcoustic"    
## [17] "note"           "gender"         "omit"           "Barkf1"        
## [21] "Barkf2"         "Barkf3"         "f2f1"           "f3f2"          
## [25] "Barkf2f1"       "Barkf3f2"       "position"       "context"       
## [29] "liquid"         "input_file"

Omitting irrelavent columns

We’ll omit the columns we don’t need.

# Let's check the number of "approximant" tokens
df_dyn |> 
  dplyr::group_by(IsApproximant) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsApproximant
##   <chr>        
## 1 yes
# Let's check the number of tokens of good recording quality
df_dyn |> 
  dplyr::group_by(IsAcoustic) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsAcoustic
##   <chr>     
## 1 yes
# How about 'omit'?
df_dyn |> 
  dplyr::group_by(omit) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##    omit
##   <dbl>
## 1     0
# Remove columns that we no longer need
df_dyn <- df_dyn |> 
  dplyr::select(-c(IsApproximant, IsAcoustic, omit, Barkf1, Barkf2, Barkf3, Barkf2f1, Barkf3f2, f2f1, f3f2))

Let’s check the column names again.

colnames(df_dyn)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "f0"             "duration"       "segment"        "previous_sound"
##  [9] "next_sound"     "word"           "f1"             "f2"            
## [13] "f3"             "time"           "note"           "gender"        
## [17] "position"       "context"        "liquid"         "input_file"

Let’s also convert the context column into IPA symbols for a more intuitive representation:

# convert the ARPABET notation into IPA symbols
df_dyn <- df_dyn |> 
  dplyr::mutate(
    context = case_when(
      context == "AE" ~ "/æ/",
      context == "IY" ~ "/i/",
      context == "UW" ~ "/u/"
    )
  )

Checking the number of participants, tokens…

Let’s also obtain some descriptive statistics here. Note that we need to divide the number of rows by 11 to obtain the accurate number of tokens, as one token now has 11 time points.

# number of participants
df_dyn |> 
  dplyr::group_by(language) |> 
  dplyr::summarise(n = n_distinct(speaker)) |> 
  dplyr::ungroup()
## # A tibble: 2 × 2
##   language     n
##   <chr>    <int>
## 1 English     14
## 2 Japanese    31
# number of tokens per segment
df_dyn |> 
  dplyr::group_by(segment) |> 
  dplyr::summarise(n = n()/11) |> # divide by 11 time points
  dplyr::ungroup()
## # A tibble: 6 × 2
##   segment     n
##   <chr>   <dbl>
## 1 LAE1      511
## 2 LIY1      408
## 3 LUW1      299
## 4 RAE1      502
## 5 RIY1      481
## 6 RUW1      314

Data visualisation

Scaling formant frequencies

Do you remember how to visualise the dynamic data? The basic procedure is the same as in the static analysis; We first apply z-score normalisation to the formant frequencies to make sure that formant values are comparable across speakers.

df_dyn <- df_dyn |> 
  dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
  dplyr::mutate(
    f1z = as.numeric(scale(f1)), # scale f1 into z-score
    f2z = as.numeric(scale(f2)), # scale f2 into z-score
    f3z = as.numeric(scale(f3)) # scale f3 into z-score
  ) |> 
  dplyr::ungroup() # don't forget ungrouping

Descriptive statistics

Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.

# check mean and sd of raw/scaled F1 values for each speaker
df_dyn |> 
  dplyr::group_by(speaker) |>
  dplyr::summarise(
    f1_mean = mean(f1),
    f1_sd = sd(f1),
    f1z_mean = mean(f1z),
    f1z_sd = sd(f1z)
  ) |> 
  dplyr::ungroup() 
## # A tibble: 45 × 5
##    speaker f1_mean f1_sd  f1z_mean f1z_sd
##    <chr>     <dbl> <dbl>     <dbl>  <dbl>
##  1 2d57ke     468.  140. -9.05e-17   1   
##  2 2drb3c     575.  243.  2.39e-16   1   
##  3 2zy9tf     459.  228. -2.41e-16   1   
##  4 3bcpyh     438.  142. -5.72e-18   1.00
##  5 3hsubn     537.  177.  6.11e-17   1   
##  6 3pzrts     458.  195. -1.44e-18   1.00
##  7 4ps8zx     467.  172.  2.19e-16   1   
##  8 54i2ks     453.  192.  1.43e-16   1.00
##  9 5jzj2h     412.  133.  2.49e-16   1.00
## 10 5upwe3     444.  189. -6.51e-17   1.00
## # ℹ 35 more rows

Visualisation

raw trajectories

Let’s visualise the raw data first:

# F2 - raw trajectories
df_dyn |> 
  ggplot(aes(x = time, y = f2z)) +
  geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
  geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F2 (z-normalised)", title = "time-varying change in F2 frequency") +
  theme(strip.text.y = element_text(angle = 0))

smooths

Let’s also try plotting smoothed trajectories:

# F2 - smooths
df_dyn |> 
  ggplot(aes(x = time, y = f2z)) +
  # geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  # geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  geom_smooth(aes(colour = language, group = language), linewidth = 1.2, se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F2 (z-normalised)", title = "smoothed time-varying change in F2 frequency") +
  theme(strip.text.y = element_text(angle = 0))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Functional Principal Component Analysis (FPCA)

At this point, we tried fitting Generalised Additive Mixed-Effect Models (GAMMs) to investigate between-group differences in F2 dynamics. This was possible because we had some ideas of possible factors (e.g., vowel context, speaker groups). In a way, lots of decisions were made in a top-down manner.

We could also try another approach; a bottom-up approach. That is, does the data tell us anything about higher-order groupings? Do the vowel context/speaker group differences emerge from the data?

Here, let’s try Functional Principal Component Analysis (FPCA) as a bottom-up approach.

# IDs = token column; tVec = time column; yVec = variable column(s)
input_df <- fdapace::MakeFPCAInputs(IDs = df_dyn$file, tVec = df_dyn$time, yVec = df_dyn$f2z)

# Check if there's any issues with the data
fdapace::CheckData(input_df$Ly, input_df$Lt)

# No errors have been returned, so let's now run fPCA on the dynamic PC1 trajectory
df_dyn_fpca <- fdapace::FPCA(Ly = input_df$Ly, input_df$Lt, optns = list(plot = TRUE))

Checking fpca results

Let’s look into the results here. summary(df_dyn_fpca) tells you what attributes are stored in the FPCA data:

summary(df_dyn_fpca)
##                Length Class    Mode   
## sigma2             1  -none-   numeric
## lambda             4  -none-   numeric
## phi               44  -none-   numeric
## xiEst          10060  -none-   numeric
## xiVar           2515  -none-   list   
## obsGrid           11  -none-   numeric
## workGrid          11  -none-   numeric
## mu                11  -none-   numeric
## smoothedCov      121  -none-   numeric
## FVE                1  -none-   numeric
## cumFVE             4  -none-   numeric
## fittedCov        121  -none-   numeric
## fittedCorr       121  -none-   numeric
## optns             35  -none-   list   
## bwMu               0  -none-   NULL   
## bwCov              0  -none-   NULL   
## inputData          2  -none-   list   
## selectK            1  -none-   numeric
## criterionValue     1  -none-   numeric
## timings            4  difftime numeric

Eigenvalues are stored in lambda. Similarly in the ordinary PCA, this shows how much variance is explained by each FPCA. This case, FPC1 explains quite a lot of variance in the data, so we might only need to look at FPC1 to understand an overall trend in the F2 dynamics.

# eigenvalues
df_dyn_fpca$lambda
## [1] 73.431312 13.589246  4.373711  1.285172

The cumulative proportion of variance is stored in cumFVE.

# the cumulative percentage of variance explained by the eigenvalue
df_dyn_fpca$cumFVE
## [1] 0.7859692 0.9314212 0.9782350 0.9919908

And as we did for the oridinary PCA, we can calculate cumFVE using eigenvalues (lambda). This manual approach differs slightly from the output shown with df_dyn_fpca$cumFVE reflecting slightly different calculation approaches, but you can see that the overall trend is still quite similar.

# calculating proportion of variance from eigenvalues
fpca_var_exp <- df_dyn_fpca$lambda / sum(df_dyn_fpca$lambda)

# compare this with cumFVE
fpca_var_exp
## [1] 0.79231501 0.14662633 0.04719182 0.01386685
df_dyn_fpca$cumFVE
## [1] 0.7859692 0.9314212 0.9782350 0.9919908

The dynamic analysis introduces the time dimension. df_dyn_fpca$workGrid gives you the time points at which data are sampled.

# list of sampling time
df_dyn_fpca$workGrid
##  [1]   0  10  20  30  40  50  60  70  80  90 100

PC scores are stored in df_dyn_fpca$xiEst. Each row is 1 token, and each column corresponds to each FPC dimension.

# PC scores 
head(df_dyn_fpca$xiEst)
##            [,1]     [,2]       [,3]        [,4]
## [1,]  1.3978292 5.260927  2.1324070 -2.52256410
## [2,] -1.5626684 5.595501  0.2863422 -1.84886236
## [3,] -0.6166366 4.291694  1.1572724 -1.35058648
## [4,] -5.5315817 2.679823  0.1415474  0.08740263
## [5,]  0.5428989 3.542994  3.8331163 -2.47430535
## [6,] -2.7433146 7.071773 -2.1509403 -1.88748022

Finally, you can get a mean curve directly by applying GetMeanCurve and a scree plot with CreateScreePlot in the fdapace package. CreatePathPlot returns a plot showing individual trajectories.

# plot
plot(df_dyn_fpca)

# Mean curve
fdapace::GetMeanCurve(Ly = input_df$Ly, Lt = input_df$Lt, optns = list(plot = TRUE))

## $mu
##  [1] -0.70831425 -0.56725924 -0.32739985 -0.04873545  0.18428169  0.31887341
##  [7]  0.37257545  0.37071962  0.31025622  0.17889910 -0.08389668
## 
## $workGrid
##  [1]   0  10  20  30  40  50  60  70  80  90 100
## 
## $bwMu
## NULL
## 
## $optns
## $optns$userBwMu
## [1] 5
## 
## $optns$methodBwMu
## [1] "Default"
## 
## $optns$userBwCov
## [1] 10
## 
## $optns$methodBwCov
## [1] "Default"
## 
## $optns$kFoldMuCov
## [1] 10
## 
## $optns$methodSelectK
## [1] "FVE"
## 
## $optns$FVEthreshold
## [1] 0.99
## 
## $optns$FVEfittedCov
## NULL
## 
## $optns$fitEigenValues
## [1] FALSE
## 
## $optns$maxK
## [1] 9
## 
## $optns$dataType
## [1] "Dense"
## 
## $optns$error
## [1] TRUE
## 
## $optns$shrink
## [1] FALSE
## 
## $optns$nRegGrid
## [1] 11
## 
## $optns$rotationCut
## [1] 0.25 0.75
## 
## $optns$methodXi
## [1] "IN"
## 
## $optns$kernel
## [1] "epan"
## 
## $optns$lean
## [1] FALSE
## 
## $optns$diagnosticsPlot
## NULL
## 
## $optns$plot
## [1] TRUE
## 
## $optns$numBins
## NULL
## 
## $optns$useBinnedCov
## [1] TRUE
## 
## $optns$usergrid
## [1] FALSE
## 
## $optns$yname
## [1] "Ly"
## 
## $optns$methodRho
## [1] "vanilla"
## 
## $optns$verbose
## [1] FALSE
## 
## $optns$userMu
## NULL
## 
## $optns$userCov
## NULL
## 
## $optns$methodMuCovEst
## [1] "cross-sectional"
## 
## $optns$userRho
## NULL
## 
## $optns$userSigma2
## NULL
## 
## $optns$outPercent
## [1] 0 1
## 
## $optns$useBinnedData
## [1] "AUTO"
## 
## $optns$useBW1SE
## [1] FALSE
## 
## $optns$imputeScores
## [1] TRUE
# scree plot
fdapace::CreateScreePlot(df_dyn_fpca)

# path plot
fdapace::CreatePathPlot(df_dyn_fpca, xlab = "normalised time", ylab = "F2 (z-normalised)")

Understanding variation captured by FPCs

We have seen that our FPCA analysis identifies that FPC1 captures the majority of variation observed in the data. Let’s check the details of this. The code below is from Strycharczuk et al. (2024) – see the file “diphthongisation_paper.html” stored in the repository.

The code below lets you visualise what variation is captured in FPC1 by adding and subtracting standard deviation to/from the mean curve.

# function: get PC scores + return data frame with PCs for each token
get_pc_scores <- function(fpcaObj){
  pcs <- data.frame(fpcaObj$xiEst)
  token <- names(fpcaObj$inputData$Lt) 
  df <- cbind(token, pcs)
  n_pcs <- length(fpcaObj$lambda) # get number of PCs
  pc_names <- paste0("PC", 1:n_pcs) # create colnames for PCs
  names(df) <- c("file", pc_names) # add colnames for token + PCs
  return(df)
}

# get PC scores w/ token info
pc_df <- get_pc_scores(df_dyn_fpca)

# join PCs (dat) with selected cols from original data frame 
## store meta info
meta <- df_dyn |>  
  select(speaker, gender, language, word, liquid, context, file)

## merge the list and meta data - unique(meta) because otherwise there would be lots of duplicates
dat <- left_join(pc_df, unique(meta), by = "file")

# function: define perturbation function (±Q = ±sd, k = PC number)
perturbation <- function(fpcaObj, Q, k){
  Q * sqrt(fpcaObj$lambda[k]) * fpcaObj$phi[,k] + fpcaObj$mu
}

# function: create perturbation object with mean and ±Q sd as a data frame (for one PC only)
# can validate against fdapace::GetMeanCurve and fdapace::CreateModeOfVarPlot
perturbation_object <- function(fpcaObj, Q, k){
  time <- fpcaObj$workGrid # grid of time values
  mean <- fpcaObj$mu # mean trajectory
  Qplus <- perturbation(fpcaObj, Q, k) # +Q sd
  Qminus <- perturbation(fpcaObj, -Q, k) # -Q sd
  df <- cbind(time, mean, Qplus, Qminus)
  colnames(df) <- c("time", "mean", "Qplus", "Qminus")
  df <- data.frame(df)
  df$PC <- paste0("PC", k) # add PC colname
  return(df)
}

# function: create perturbation data frame with mean and ±Q sd (for all PCs)
# to do: add ability to pass list of Q values for gradient perturbation function
get_perturbation <- function(fpcaObj, Q){
  n_pcs <- length(fpcaObj$lambda)
  k <- 1:n_pcs
  df <- lapply(k, perturbation_object, fpcaObj=fpcaObj, Q=Q)
  df <- dplyr::bind_rows(df) # unnest lists into single df
  return(df)
}

# get mean trajectory and ±2 sd for all PCs
Q <- seq(from = -2, to = 2, by = 0.1)
pQ <- lapply(Q, get_perturbation, fpcaObj = df_dyn_fpca)
names(pQ) <- Q # name pQ lists using value of Q
pQ <- dplyr::bind_rows(pQ, .id = "Q") # collapse lists together
pQ$Q <- as.numeric(pQ$Q) # make 'Q' column numeric

# visualise variation along each FPC
pQ |> 
  dplyr::filter(PC %in% c('PC1','PC2', 'PC3', 'PC4')) |> 
  dplyr::mutate(
    PC = case_when(
      PC == "PC1" ~ "FPC1",
      PC == "PC2" ~ "FPC2",
      PC == "PC3" ~ "FPC3",
      PC == "PC4" ~ "FPC4"
    )
  ) |> 
  tidyr::pivot_longer(Qplus:Qminus, names_to = "Qsd", values_to = "value") |>
  ggplot2::ggplot() +
  aes(x = time, y = value, colour = Q, group = interaction(Q, Qsd)) +
  geom_path() +
  facet_wrap(~ PC, ncol = 2) +
  scale_colour_gradient2(low = "#E69F00", mid = "#56B4E9", high = "#009E73", midpoint = 0)+
  labs(x = "Time (normalised)", y = "Reconstructed F2 values", color = "PC score")

Remember that FPC1 explains 79.23% of the variance in the data. FPC1 shows that the variation is larger towards the end of the interval, which seems to suggest that FPC1 captures the difference in the shape and height of F2 trajectories depending on the vowel context. FPC2, on the other hand, shows a greater variation closer to the onset of the interval, which might correspond to the difference between English /l/ and /ɹ/.

Reconstructing F2 trajectories based on FPCs

We can reconstruct the F2 trajectories using the information captured by each FPC to better understand the dimension captured by FPC1.

working out necessary parameters

Let’s first work out parameters required for reconstructing the original PC1 trajectories. This includes: (1) mean curve, (2) time points used to fit FPCA, (3) eigenfunction associated with each time point and (4) PC loadings for each PC. These are stored in the fdapace::FDA object that we obtained from the FPCA analysis earlier. The following code computes eigenfunctions manually.

# mean fPC1 trajectory
# pc1_mean_curve <- fdapace::GetMeanCurve(Ly = input.PC1$Ly, Lt = input.PC1$Lt, optns = list(plot = TRUE))
mu_values <- data.frame(df_dyn_fpca$mu) # mean curve values
mu_time <- data.frame(df_dyn_fpca$workGrid) # timepoints used for estimating the curve
phi <- data.frame(df_dyn_fpca$phi) # eigenfunction at each timepoint: workGrid * nlambda (e.g., 255 = 51 workGrid * 5 lambda)
lambda <- data.frame(df_dyn_fpca$lambda) # PC loadings for each PC: currently 5

# create a data frame containing mean curve, time and eigenfunctions assocaited with each PC at each time point
## add an extra column 'col_number' as a common index across the data frames - useful when merging everything together later on
### mean curve
mu_values <- mu_values |>  
  dplyr::mutate(
    col_number = row_number()
  )

### sampling time points
mu_time <- mu_time |>  
  dplyr::mutate(
    col_number = row_number()
  )

### eigenfunction
phi <- phi |>  
  dplyr::mutate(
    col_number = row_number()
  )

### pc loadings
lambda <- lambda |>  
  dplyr::mutate(
    PC = str_c("PC", row_number()),
    PC = str_c(PC, "lambda", sep = "_")
  ) |>  
  tidyr::pivot_wider(names_from = "PC", values_from = "df_dyn_fpca.lambda") |>  
  dplyr::slice(rep(1:n(), each = 51)) |>  
  dplyr::mutate(
    col_number = row_number()
  )
  
## merging all data together one by one
rec <- dplyr::left_join(mu_values, mu_time, by = "col_number")
rec <- dplyr::left_join(rec, phi, by = "col_number")
rec <- dplyr::left_join(rec, lambda, by = "col_number")

## tidying up some column names
rec <- rec |>  
  dplyr::select(col_number, df_dyn_fpca.workGrid, df_dyn_fpca.mu, X1, X2, X3, X4, PC1_lambda, PC2_lambda, PC3_lambda, PC4_lambda) |>  
  dplyr::rename(
    mean = df_dyn_fpca.mu,
    time = df_dyn_fpca.workGrid,
    PC1_eigen = X1,
    PC2_eigen = X2,
    PC3_eigen = X3,
    PC4_eigen = X4
  )

## plotting the eigenfunctions - this should match with a sub-plot in bottom right created with plot(PC1)
rec |>  
  ggplot() +
  # geom_path(aes(x = time, y = mean)) +
  geom_path(aes(x = time, y = PC1_eigen), colour = "black", linewidth = 1.5) +
  geom_path(aes(x = time, y = PC2_eigen), colour = "red", linetype = 2, linewidth = 1.5) +
  geom_path(aes(x = time, y = PC3_eigen), colour = "darkgreen", linetype = 3, linewidth = 1.5) +
  # geom_path(aes(x = time, y = value, colour = pc)) +
  geom_hline(yintercept = 0) +
  labs(x = "time", y = "eigenfunctions", title = "First 3 eigenfunctions")

## check if this matches plot(PC1)
plot(df_dyn_fpca)

Let’s now merge the FPCA parameters with the meta data.

# PC scores -> each row is 1 token, each column is one PC
head(df_dyn_fpca$xiEst)
##            [,1]     [,2]       [,3]        [,4]
## [1,]  1.3978292 5.260927  2.1324070 -2.52256410
## [2,] -1.5626684 5.595501  0.2863422 -1.84886236
## [3,] -0.6166366 4.291694  1.1572724 -1.35058648
## [4,] -5.5315817 2.679823  0.1415474  0.08740263
## [5,]  0.5428989 3.542994  3.8331163 -2.47430535
## [6,] -2.7433146 7.071773 -2.1509403 -1.88748022
# PC scores have already been added to the main data set
head(dat)
##                     file        PC1      PC2        PC3         PC4 speaker
## 1 EN_5jzj2h_lamb015_0001  1.3978292 5.260927  2.1324070 -2.52256410  5jzj2h
## 2 EN_5jzj2h_lamb046_0001 -1.5626684 5.595501  0.2863422 -1.84886236  5jzj2h
## 3 EN_5jzj2h_lamb077_0001 -0.6166366 4.291694  1.1572724 -1.35058648  5jzj2h
## 4 EN_5jzj2h_lamb108_0001 -5.5315817 2.679823  0.1415474  0.08740263  5jzj2h
## 5 EN_5jzj2h_lamb139_0001  0.5428989 3.542994  3.8331163 -2.47430535  5jzj2h
## 6 EN_5jzj2h_lamp033_0001 -2.7433146 7.071773 -2.1509403 -1.88748022  5jzj2h
##   gender language word liquid context
## 1   Male  English lamb      L     /æ/
## 2   Male  English lamb      L     /æ/
## 3   Male  English lamb      L     /æ/
## 4   Male  English lamb      L     /æ/
## 5   Male  English lamb      L     /æ/
## 6   Male  English lamp      L     /æ/
# duplicate each row by 51 times 
dat_time <- dat |> 
  dplyr::slice(rep(1:n(), each = 51))

# add col_names to merge with the other data frame
dat_time <- dat_time |>  
  dplyr::group_by(file) |>  
  dplyr::mutate(
    col_number = row_number()
  ) |>  
  ungroup()

# merge
dat_time <- left_join(dat_time, rec, by = "col_number")

### Reconstructed individual F2 trajectories based on FPC1

Finally, here is the reconstructed F2 trajectories based on FPC scores for FPC1.

# reconstruct individual trajectory tokens
dat_time <- dat_time |>  
  dplyr::mutate(
    PC1_reconstruct = PC1 * PC1_eigen + mean,
    PC2_reconstruct = PC2 * PC2_eigen + mean,
    PC3_reconstruct = PC3 * PC3_eigen + mean,
    PC4_reconstruct = PC4 * PC4_eigen + mean
  )

# visualise FPC1
dat_time |>  
  ggplot() +
  geom_path(aes(x = time, y = PC1_reconstruct, group = file, colour = context), alpha = 0.2, show.legend = TRUE) +
  scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
  labs(x = "Proportional Time (%)", y = "F2", title = "Reconstructed F2 from FPC1") +
  geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  facet_grid(liquid ~ language)
## Warning: Removed 100600 rows containing missing values or values outside the scale range
## (`geom_path()`).

It is indeed true that FPC1 seems to capture variation associated with vowel context. The visualisation suggests that L1 English speakers have a smaller variation especially for English /ɹ/. It is also quite evident that L1 Japanese speakers do not seem to differentiate F2 trajectories between /æ/ and /u/, which is rather surprising.

### Reconstructed individual F2 trajectories based on FPC2

What does F2 trajectory looks like when reconstructed from FPC2?

# visualise FPC2
dat_time |>  
  ggplot() +
  geom_path(aes(x = time, y = PC2_reconstruct, group = file, colour = context), alpha = 0.2, show.legend = TRUE) +
  scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
  labs(x = "Proportional Time (%)", y = "F2", title = "Reconstructed F2 from FPC2") +
  geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  facet_grid(liquid ~ language)
## Warning: Removed 100600 rows containing missing values or values outside the scale range
## (`geom_path()`).

OK, so there’s something interesting here: we speculated earlier that FPC2 may be capturing the difference between English /l/ and /ɹ/, but it doesn’t seem to be the case. Rather, FPC2 seems to capture the degree of vocalic anticipatory coarticulation that can be observed during the liquid interval.

### Reconstructed individual F2 trajectories based on FPC1+FPC2

Finally, it is also possible to account for a joint contribution of FPC1 and FPC2 by summing the values together:

# reconstruct individual trajectory tokens
dat_time <- dat_time |>  
  dplyr::mutate(
    joint_PC1_PC2 = (PC1 * PC1_eigen) + (PC2 * PC2_eigen) + mean
  )

# visualise FPC1+FPC2
dat_time |>  
  ggplot() +
  geom_path(aes(x = time, y = joint_PC1_PC2, group = file, colour = context), alpha = 0.2, show.legend = TRUE) +
  scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
  labs(x = "Proportional Time (%)", y = "F2", title = "Reconstructed F2 from FPC1+FPC2") +
  geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  facet_grid(liquid ~ language)
## Warning: Removed 100600 rows containing missing values or values outside the scale range
## (`geom_path()`).